A Lazy, Self-optimizing Parallel Matrix Library

نویسندگان

Simon Govier

Paul H. J. Kelly

چکیده

This paper describes a parallel implementation of a matrix/vector library for C++ for a large distributed-memory multicomputer. The library is “self-optimising” by exploiting lazy evaluation: execution of matrix operations is delayed as much as possible. This exposes the context in which each intermediate result is used. The run-time system extracts a functional representation of the values being computed and optimises data distribution, grain size and scheduling prior to execution. This exploits results in the theory of program transformation for optimising parallel functional programs, while presenting an entirely conventional interface to the programmer. We present details of some of the simple optimisations we have implemented so far and illustrate their effect using a small example. Conventionally, optimisation is confined to compile-time, and compilation is completed before run-time. Many exciting opportunities are lost by this convenient divide. This paper presents one example of such a possibility. We do optimisation at run-time for three important reasons: We wish to deliver a library which uses parallelism to implement ADTs efficiently, callable from any client program (in any sensible language) without special parallel programming expertise. This means we cannot perform compile-time analysis of the caller’s source code. We wish to perform optimisations which take advantage of how the client program uses the intermediate values. This would be straightforward at compile-time, but not for a library called at run-time. We wish to take advantage of information available only at run-time, such as the way operations are composed, and the size and characteristics of intermediary data structures. We aim to get much of the performance of compile-time optimisation, possibly more by using run-time information, while retaining the ease with which a library can be installed and used. There is some run-time overhead involved, which limits the scope of the approach. 1 Background: parallelism via libraries Software engineers in computational science and engineering are not keen to invest in substantial restructuring of their applications in order to improve performance on parallel computers. Fully-automatic parallelisation is obviously an ideal, but an attractive alternative is to use a parallel library. This way, the complexity of using parallelism can be avoided completely. Most of the application need not be changed at all, and, assuming proper inter-language working standards, the user can use any language and any compiler. Host-cell data movement A parallel program built this way has a characteristic structure: there is a single, master processor (“host”) which runs the user’s program, and a set of worker processors (“cells”) which are involved only in parallel operations. Functional Programming, Glasgow 1995 1 A Lazy, Self-optimising Parallel Matrix Library Interesting parallel library functions generally manipulate large data structures as operands and/or results, such as matrices, graphs, relations etc. When a parallel operation on such an ADT is called, the operand data must be partitioned and distributed to the cells. When the operation has finished, the host can assemble the result by combining fragments from each cell. Redundant data movement If the host needs to assemble the result matrix in order to print it, or to perform some test which will determine control flow, then perhaps this assembly stage is absolutely necessary. Very often, though, the result of a library operation is soon passed to another operator belonging to the same library. If this, too, is a parallel operation, then it would have been better to delay assembling the result, in case it turns out to be needed in distributed form by the operation which uses it. The composition problem The central problem with parallel libraries is that there is no opportunity to eliminate this redundant data movement at compile-time. In fact, further optimisation could reduce data movement even more, for example by selecting parallel implementations of the operations used so that data is generated where it will be needed by the operation which follows. The skeleton approach One approach to this problem, advocated for example in [Darlington et al., 1993a] and [Darlington et al., 1993b], is to parameterise parallel library functions with user code. Higher-order parallel functions, called “skeletons”, capture general algorithmic structures. In some cases at least, the library implementor can guarantee good parallel performance whatever the user code’s characteristics (this is claimed, for example, in [Cole, 1989]). In general, it doesn’t make sense to parameterise a skeleton with client code which itself employs a skeleton. A poor schedule results if both skeletons run in ignorance of one another. However, skeletons can be combined explicitly using higher-level skeleton composition operators. These combinators can be used to capture precisely how intermediate values are passed from skeleton to skeleton. In practice, what this means is that the outermost control structure of the application has to be rewritten in a functional “skeleton composition language” (SCL): Most of user’s code is unchanged, encapsulated in (i.e. called from) SCL. SCL provides high-level control structure + distributed data structures. SCL controls interfaces between fragments of reused user code. This idea of building applications from reused components has software engineering advantages of course (as observed, for example, in [Magee and Dulay, 1991]). It also offers the opportunity to do inter-skeleton optimisations, for example to minimise data movement and improve schedules. This has been the goal of a number of recent efforts, including [Harrison, 1992], [Lucco and Sharp, 1991], [Peiro, 1994] and perhaps most notably the PL compiler [Danelutto et al., 1992]. A compiler for a program built this way uses dependence information from SCL only: no need to analyse encapsulated code. uses performance models for target architecture and skeleton implementation schemes to derive optimal parallel implementation. The optimisation is taken care of by the SCL compiler, and performance tuning is handled automatically. Functional Programming, Glasgow 1995 2 A Lazy, Self-optimising Parallel Matrix Library The objective of this work The problem is that skeletons don’t do what users want. Although the promised automatic performance tuning would be very valuable, it remains unattractive for some users if it can only be achieved by rewriting the outer control structure of the application in a functional language. In this paper we outline a scheme which uses essentially the same optimisation and program transformation techniques, but the library functions are called by the user code. This makes it easy to parallelise an application in an incremental fashion. We aim to gain the same performance advantage. 2 A run-time approach to the composition problem We illustrate this idea using a parallel matrix library as an example. Our work has been based on C++, which is convenient because of its overloading. In principle any client language could have been used, since there is no compiletime processing. Lazy data movement The first inefficiency to dispose of is to avoid unnecessary movement of matrices to and from the host. To do this, we represent each matrix by an opaque “handle”. The run-time system keeps track of whether the matrix is stored on the host or in the cells (or both), and if it is stored on the cells it also records how it is distributed. This way, a matrix result need never be assembled on the host. If the host program does refer to matrix elements individually (for example to print the matrix), communication will be required, and it may be more efficient to collect blocks of the matrix to reduce the number of message exchanges needed. If the distribution of a matrix happens to coincide with the way the following operation needs it, then data movement is avoided. If the data is held in the cells, but in some other distribution, the cells communicate among themselves to get the components they need. We return to this issue in section 4. Lazy evaluation We can often avoid unnecessary data movement by deciding the distribution of operands on the basis of how the results of an operation are to be used. To gain the opportunity to do this, we delay execution of library calls as long as possible. This is not apparent to the calling program, since the representation of the matrix is hidden behind the opaque “handle”: we choose to represent the value in terms of the operations involved in computing it, that is as a symbolic expression DAG. Evaluation cannot be delayed indefinitely: The matrix may be used in an operation which produces a non-opaque result, such as a scalar (e.g. the matrix is required on the host for printing or testing). We refer to both these cases as “strict” contexts. The expression DAG representing the value may become too large. Dependences and name management There is a problem with delaying evaluation: we are working with an imperative language with assignment. Isn’t there a danger that the operands will have changed by the time we actually do the calculation? This problem, essentially that of respecting antiand output-dependences, i.e. write-after-read (WAR) and writeafter-write (WAW) hazards, is solved using a simple technique analogous to the register renaming found in [Tomasulo, 1967]. Each operation returns a new handle, referring to a distinct matrix result. Consider the example shown in Fig. 1. We have used overloading and templates in C++ to provide a concise and flexible syntax which is hopefully self-explanatory: Functional Programming, Glasgow 1995 3 A Lazy, Self-optimising Parallel Matrix Library #include #include void main() f matrix> a(500,500); matrix> b(500,500); matrix> c(500,500); S1: infile >> a; // read a from file S2: infile >> b; // read b from file S3: c = a * b; S4: a = a * 2; S5: outfile << a; // this forces evaluation of matrix a S6: outfile << c; // this forces evaluation of matrix c g Figure 1: Evaluation of c can be delayed even though a is redefined. In statement S1, the host reads the matrix and stores it on its heap. Client variable a contains a handle, , which refers to it. Statement 2 is similar, storing handle in b. In S3, the matrix-matrix multiply operator returns a handle, say , in unevaluated form, and it is which is stored in c. The handle points to a DAG node indicating that the value is to be computed by multiplying the matrices referred to by handles and . In S4 the matrix-scalar add operator returns another handle, , representing the addition of 2 to each element of the matrix referred to by handle . Variable a holds the new handle , but the matrix referred to by is still available. In S5, evaluation of handle is forced. in S6, evaluation of handle is forced, picking up and despite a’s new value. Garbage collection The renaming idea relies on keeping handles in case the client program should refer to them again. In general we have no control over what the client does with handles, which can be stored and copied at will. We therefore have a problem reclaiming the space occupied by the matrices or recipes they refer to. We do not have a general solution to this problem: we need a garbage collector. In C++, however, it is not hard to keep reference counts by exploiting the constructor/destructor mechanism, and this is adequate since there will be no cycles (the expression graphs are guaranteed to be acyclic because each expression node built can refer only to pre-existing nodes).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Runtime Interprocedural Data Placement Optimisation for Lazy Parallel Libraries (Extended Abstract)

We are developing a lazy, self-optimising parallel library of vector-matrix routines. The aim is to allow users to parallelise certain computationally expensive parts of numerical programs by simply linking with a parallel rather than sequential library of subroutines. The library performs interprocedural data placement optimisation at runtime, which requires the optimiser itself to be very eec...

متن کامل

Automatic Data Distribution Optimisation in a Lazy, Self-optimising Parallel Matrix Library (extended Abstract)

This short paper describes a matrix-vector library implementation running on the Fujitsu AP1000. The library optimises data distribution at run-time, taking advantage of information about how operands and results are used by delaying evaluation where possible. The work extends our earlier paper on the subject 5] by giving a general methodology for representing data distributions, which is then ...

متن کامل

Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters

SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. SkePU provides programmability,...

متن کامل

Lazy Parallelization: A Finite State Machine Based Optimization Approach for Data Parallel Image Processing Applications

Performance obtained with existing library-based parallelization tools for implementing high performance image processing applications is often sub-optimal. This is because inter-operation optimization (or: optimization across library calls) is often not incorporated in the library implementations. This paper presents a simple, efficient, finite state machine-based method for global performance...

متن کامل

Incorporating Historic Knowledge into a Self-Optimizing Communication Library for High Performance Computing

Emerging computing systems have a wide variety of hardware and software components influencing the performance of parallel applications, presenting end-users with a (nearly) unique execution environment on each parallel machine. One of the big challenges of High Performance Computing is therefore to develop portable and efficient codes for any execution environment. The Abstract Data and Commun...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1995

A Lazy, Self-optimizing Parallel Matrix Library

نویسندگان

چکیده

منابع مشابه

Runtime Interprocedural Data Placement Optimisation for Lazy Parallel Libraries (Extended Abstract)

Automatic Data Distribution Optimisation in a Lazy, Self-optimising Parallel Matrix Library (extended Abstract)

Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters

Lazy Parallelization: A Finite State Machine Based Optimization Approach for Data Parallel Image Processing Applications

Incorporating Historic Knowledge into a Self-Optimizing Communication Library for High Performance Computing

عنوان ژورنال:

اشتراک گذاری